Introduction to Big Data with High-Performance Computing
In this workshop, we will dive into utilizing Big Data techniques with High-Performance Computing Resources 🖥️ 💥
We will cover:
🔍 General concepts of Big Data
🧪 Simple examples on Hoffman2
💌 Suggestions and Feedback, Email cpeterson@oarc.ucla.edu
This presentation and accompanying materials are available on our UCLA OARC’s GitHub repository 📚
View slides in:
📄 PDF format - BigDataHPC.pdf
🌐 HTML format: Workshop Slides
You may want to install Spark/Dask now to follow along - INTRO.md
Note: 🛠️ This presentation was built using Quarto and RStudio. Quarto file:
BigDataHPC.qmd
The term Big Data refers to datasets, data processing, data modeling, and other data science tasks that become too large and complex for traditional techniques 💥
Are you working with data? Then, absolutely YES! 🎉
Big Data provides solutions for diverse research areas, scaling up research to new heights! 🚀
Projects with lots of DATA come with their own set of challenges 😰:
High-Performance Computing (HPC) resources can supercharge 🚀 the solving of Big Data challenges by providing much more computing power than typical workstations 💪.
Big Data is often characterized by:
Value: The potential insights and ‘worth’ that can be extracted from the data. 💡
Veracity: The reliability, authenticity, and overall quality of the data. Includes handling missing values, data imputation, etc. 👍
Variability: The adaptability of data in different formats, sources, and alignment with current data science methods. Raw and unstructured data can be tricky to manage. 🔃
💡 Understanding your data can help you make informed decisions on which Big Data techniques to employ.
Image source - DASK https://ml.dask.org/index.html
Various frameworks, APIs, and libraries for Big Data projects:
RDD - Resilient Distributed Dataset
Spark supports different levels of persistence for performance optimization.
Along with RDDs, Spark also has an API for DataFrames (similar to Pandas).
SparkSession is the entry point for using the DataFrame and Dataset API.
spark = SparkSession.builder \
.appName("MyPySpark") \
.config("spark.driver.memory", "15g") \
.getOrCreate()SparkContext is the entry point for creating RDDs.
spark object from SparkSession.builderEasiest way to install PySpark is by anaconda3.
This is great when running PySpark on a single compute node.
module load anaconda3
conda create -n mypyspark openjdk pyspark python=3.9 \
pyspark=3.3.0 py4j jupyterlab findspark \
h5py pytables pandas \
-c conda-forge -y
conda activate mypyspark
pip install ipykernel
ipython kernel install --user --name=mypysparkThis will create a conda env named, mypyspark, with access to Jupyter
This conda env will have both Spark and PySpark installed
Note
Information on using Anaconda can be found from a previous workshop
Let’s practice basic PySpark functions with examples.
Spark_basics.ipynb from spark-ex1cd $SCRATCH
git clone https://github.com/ucla-oarc-hpc/WS_BigDataOnHPC
cd WS_BigDataOnHPC
cd spark-ex1We will download “The Hound of the Baskervilles”, by Arthur Conan Doyle
We will use the h2jupynb script to start Jupyter on Hoffman2
You will run this on your LOCAL computer.
wget https://raw.githubusercontent.com/rdauria/jupyter-notebook/main/h2jupynb
chmod +x h2jupynb
#Replace 'joebruin' with you user name for Hoffman2
#You may need to enter your Hoffman2 password twice
python3 ./h2jupynb -u joebruin -t 5 -m 10 -e 2 -s 1 -a intel-gold\\* \
-x yes -d /SCRATCH/PATH/WS_BigDataOnHPC/spark-ex1Note
The -d option in the python3 ./h2jupynb will need to have the $SCRATCH/WS_BigDataOnHPC full PATH directory
This will start a Jupyter session on Hoffman2 with ONE entire intel-gold compute node (36 cores)
More information on the h2jupynb can be found on the Hoffman2 website
This example will use Spark’s Machine Learning library (MLlib)
We will use data from the Million song subset
This subset has ~500,000 songs with:
Download the dataset
We will use multiple nodes to run Spark
In the previous example, we used pyspark with 1 (36-core) compute node.
To do this, we will NOT use the Spark installation from our conda install, but use spark from a build that we will download from the spark website.
mkdir -pv $SCRATCH/WS_BigDataOnHPC/apps/spark
cd $SCRATCH/WS_BigDataOnHPC/apps/spark
wget https://archive.apache.org/dist/spark/spark-3.3.0/spark-3.3.0-bin-hadoop3.tgz
tar -vxf spark-3.3.0-bin-hadoop3.tgzNote
Though we will not use the Spark from conda, we will still use the PySpark package that was install with conda. The Spark and PySpark packages will need to be the same version (3.3.0 in this example)
Since we are using our Spark build that we just downloaded, we will start spark and submit it as a job, then connect to jupyter.
spark-ex2
pyspark-multi-jupyter.jobIn this example, we will use 3 compute nodes in total.
Tip
For large data jobs, I like to have the Spark driver to be separate from the workers.
Large data jobs may require the Spark driver to have a heavy CPU load and memory.
spark-test.JOBID file will display the MASTER node nameRun this ssh -L command on your LOCAL computer
# Replace NODENAME with the name of the MASTER node
# Replace joebruin with you Hoffman2 user name
ssh -L 8888:NODENAME:8888 joebruin@hoffman2.idre.ucla.eduThis will create a SSH tunnel to the master compute node so we can open Jupyter at http://localhost:8888
Then we can open the notebook named MSD.ipynb
Spark has a visual dashboard that can view the tasks in real-time
By default, Spark will run this dashboard on port 4040
Create a ssh tunnel to the compute node to view the dashboard on your local machine
You will need to replace NODENAME with the master compute node that has your Spark job
You can run Spark as a batch job to run non-interactively.
spark-submit to start the pyspark calucation located at:
$SPARK_HOME/bin/spark-submitI have another Machine Learning example for Spark that I may not have time to go over in this workshop.
In this example, we will train a Machine Learning model using data from LIBSVM
spark-bonus
ML-bonus.ipynbDask has an Arrays API created from NumPy-like chunks
Image source - https://docs.dask.org/en/stable/
Dask DataFrames are Pandas-like objects and are composed of Pandas-like “chunks”.
Image source - https://docs.dask.org/en/stable/
module load anaconda3
conda create -n mydask python pandas jupyterlab joblib seaborn \
dask dask-ml nodejs graphviz python-graphviz \
-c conda-forge -y
conda activate mydask
pip install ipykernel
ipython kernel install --user --name=mydaskThis will create a conda env, mydask, that will have
We will use the h2jupynb script to start Jupyter on Hoffman2
You will run this on your LOCAL computer.
python3 ./h2jupynb -u joebruin -t 5 -m 10 -e 2 -s 1 -a intel-gold\\* -x yes \
-d /SCRATCH/PATH/WS_BigDataOnHPC/dask-ex1Replace joebruin with your Hoffman2 user account.
Replace /SCRATCH/PATH/WS_BigDataOnHPC with the full PATH name of the workshop on Hoffman2
dask-ex1
dask_basic.ipynbDask has a Dask-ML library with scalable Machine Learning methods. There is also integration with:
This example will use Scikit-Learn with Dask
We will use data from the Million song subset
cd $SCRATCH/WS_BigDataOnHPC/dask-ex2
wget https://archive.ics.uci.edu/ml/machine-learning-databases/00203/YearPredictionMSD.txt.zip
unzip YearPredictionMSD.txt.zipjoebruin with your Hoffman2 user account/SCRATCH/PATH/WS_BigDataOnHPC with the full PATH name of the workshop on Hoffman2python3 ./h2jupynb -u joebruin -t 5 -m 10 -e 2 -s 1 -a intel-gold\\* -x yes -d /SCRATCH/PATH/WS_BigDataOnHPC/dask-ex2MSD-dask.ipynbDask has a visual dashboard that can view the tasks in real-time
You will need to replace NODENAME with the compute node that has your Dask job
Optimize your resources for seamless project execution! 💪📈